Whistle Bias? Investigating Referee Influence on WNBA Away Game Outcomes

INFO 523 - Final Project

This project aims to investigate potential officiating bias in WNBA games by analyzing referee crew assignments, foul distributions, and game outcomes. The primary objective is to determine whether certain referee combinations disproportionately favor home teams or exhibit consistent patterns of foul disparities.
Author
Affiliation

Amy Esplain

College of Information Science, University of Arizona

Abstract

The Women’s National Basketball Association (WNBA) has experienced significant growth in recent years, accompanied by an increasing emphasis on data analytics to enhance forecasting and anomaly detection capabilities. This project seeks to evaluate the fairness of officiating in the WNBA by applying data mining techniques to referee assignment data, foul differentials, and game outcomes. The primary objective is to identify potential officiating bias and assess the extent to which individual referees may contribute to a home-court advantage. This study will examine potential officiating bias in WNBA games by analyzing referee assignment data, foul differentials, and game outcomes across multiple seasons.

Research Question

The research questions guiding this project are designed to uncover patterns in officiating behavior within the WNBA using unsupervised data mining techniques. Rather than testing predefined hypotheses, the goal is to explore underlying structures and trends in referee decision-making that may indicate systemic tendencies or inconsistencies.

  1. Do home teams have significantly higher win rates under specific referee crews?
  2. When games are officiated by certain referee combinations, do they have higher or lower foul disparity?
  3. Do certain referees call more fouls on away teams?

Dataset

The dataset is a granular play-by-play level dataset in order to capture the fouls called within a game with an identifiable referee. The chosen dataset is from Kaggle created by Vladislav Shufinskiy (dataset link) who combined several sources into several datasets for publicly available use. Leveraging this dataset eliminates issues with limitations on API data requests per game for play-by-play details.

General Game Data Overview

The dataset consists of 65 games spanning August 2022 to October 2024. A typical WNBA season includes around 40 games between May and September. However, this dataset covers only 65 games total across this period and includes data for 11 teams rather than the full 12 active teams.Therefore, the dataset is incomplete, and any insights drawn from this analysis should be interpreted with caution, as they may not fully represent league-wide trends or season-long dynamics.

Number of games: 65
Date range: August 2022 to October 2024
Number of teams: 11
Teams: ['ATL', 'CHI', 'CON', 'DAL', 'IND', 'LVA', 'MIN', 'NYL', 'PHO', 'SEA', 'WAS']

Outcomes by Season:
2022: 23 games, 12 home wins , average point difference: 5
2023: 20 games, 13 home wins , average point difference: 9
2024: 22 games, 17 home wins , average point difference: 6

Summary of Referee Assignments

A total of 27 unique referees appear across these games, with an average of 3.05 referees per game, which aligns with the standard three-person officiating crew format in the WNBA. Slight variation above 3.0 suggests instances of additional referee records, possibly due to substitutions or overtime data. The most active referees is defined by the number of appearances across distinct games. There are 7 referees who have appeared more than 15 times across the 65 games data set (Figure 1).

Games with referee data: 65 out of 65 total games (100.0%)
Records with referee data: 2,793 (9.9%)
Unique referees: 27
Average referees per game: 3.05

Missing Values Analysis

Referee ID data is missing across most action types. Core gameplay events such as shots, rebounds, substitutions, steals, and blocks have 100% missing referee IDs, which reflects structural design rather than data errors since these actions don’t require official attribution (Table 1). In contrast, fouls (0% missing) and violations (8% missing) consistently record referee IDs, making them reliable categories for analyzing officiating behavior (Table 1).

Turnovers (56% missing) subtypes shows that missing referee IDs are tied to the nature of the event. There are subtypes like bad passes (100%), lost ball (98.7%), and other unforced errors almost never log a referee, reflecting that these turnovers occur without a whistle.

In contrast, whistle-driven subtypes such as offensive fouls, traveling, double dribble, 5-second, 8-second, and inbound violations Referee IDs always recorded (Table 2).Intermediate categories like 3-second violations (17.6%), backcourt (11.1%), shot clock (10.9%), and out-of-bounds (4.5%) have high but not perfect coverage, likely due to inconsistent logging (Table 2). This confirms referee attribution is reliable only for whistle-based turnovers which will be used in this analysis.

Table 1: Action Type Breakout
              total_count  referee_count  percent_missing
actionType                                               
2pt                  6022            0.0            100.0
3pt                  2940            0.0            100.0
block                 566            0.0            100.0
freethrow            2044            0.0            100.0
game                   65            0.0            100.0
jumpball              191            0.0            100.0
substitution         5070            0.0            100.0
period                528            0.0            100.0
rebound              5416            0.0            100.0
steal                 876            0.0            100.0
timeout               664            0.0            100.0
turnover             1631          711.0             56.4
violation              87           80.0              8.0
foul                 2003         2002.0              0.0


Table 2: Missing Referee ID Values for Turnover Subtypes
                       total_count  referee_count  percent_missing
subType                                                           
bad pass                       582            0.0            100.0
jumpball violation               2            0.0            100.0
offensive-kicked-ball            1            0.0            100.0
lost ball                      305            4.0             98.7
3-second-violation              17           14.0             17.6
backcourt                        9            8.0             11.1
shot clock                     129          115.0             10.9
out-of-bounds                  337          322.0              4.5
traveling                       96           95.0              1.0
5-second-violation               2            2.0              0.0
double dribble                   2            2.0              0.0
8-second-violation               2            2.0              0.0
offensive foul                 142          142.0              0.0
inbound                          1            1.0              0.0

Feature Engineering

Using the original play-by-play data, two additional datasets were constructed:

Individual Referee–Game Level Dataset

Each row represents an individual referee’s involvement in a specific game. This dataset includes game-level information (scores, outcome, fouls, competitiveness) alongside referee-specific statistics such as the number of fouls and turnover violations they called, and a breakdown of turnover calls by subtype. This structure enables analysis of individual referee behavior across games. Additionally, referees IDs were mapped to a letter to help with the readability.

Figure 3 and 4 shows a sorted view of the individual referees’ average difference in foul and whistle turnover call per game between away and home.
Those who call more fouls or whistle turnovers on the away team are on the left, while least is on the right.

Referee Crew–Game Level Dataset

Each row represents the crew assigned to a particular game. Crews are defined as the set of referees officiating together. This dataset allows for the evaluation of crew-level dynamics such as whether certain combinations of officials are associated with higher foul counts or whistle turnovers. Additionally, referees crews were mapped to the respective individual referee letter to help with the readability.

Figure 5 and 6 show a sorted view where the referee crews who call more fouls or whistle turnovers on the away team are on the left, while least is on the right.

Analysis Overview

The analysis profiles the individuals referees (Figure 7) and the referee crews (Figure 8) based on their average turn over whistle difference per game and their average foul difference per game. Both metrics take the difference between away and home games. This analysis helps identifies which referees and referee crews that are calling more fouls and / or whistle based turnovers.

Each graph is broken into quadrants, where the top right indicates more fouls and more turnovers called on away teams (more strict on away) while the bottom left represents fewer falls and turnovers called on the away team (which indiciates more strict on home teams).

Individual Referee Analysis

In Figure 7, most referees are close to the origin which represents neutral to away and home teams.

  • Ref D and Ref V in the top-right quadrant, call more fouls and more whistle-turnovers on away teams (“strict on away”).

  • Ref O is also right of center with a positive whistle-turnovers difference, suggesting a milder version of that strict pattern.

  • Ref B in the bottom-right calls more fouls on away but fewer whistle-turnovers (“foul-heavy on away”).

  • Ref P in far bottom-left calls fewer fouls and fewer whistle-turnovers on away teams (“lenient toward the away”).

Referee Crew Analysis

In Figure 8, most referee crews cluster near the origin which implies little systematic difference between whistles on away vs. home teams.

Top-right (strict on away):

  • Ref H, Ref K, Ref N and Ref C, Ref M, Ref V call more fouls and more whistle turnovers on away teams.

  • Ref D, Ref L, Ref N is strongly foul-heavy on away with moderate extra TOs.

Top-left (turnover-heavy on away):

  • Ref G, Ref J, Ref N show fewer fouls but more whistle turnovers on away teams.

Bottom area (lenient on away for turnovers):

  • There is one extreme crew (Ref C, Ref H, Ref V) has much fewer whistle turnovers on away with near-neutral fouls.

Choosing Number of Clusters

The chosen clusters are based on the Calinski-Harabasz (CH) curves for both individual referees and referee crews using the features previously described. The clusters chosen are the following:

  • Individual Referee Clusters (k): 6 clusters. In Figure 8, the CH curve jumps sharply from k=2 to k=3 and then flattens, with a modest uptick around k≈6–7. That pattern suggests k=6 is enough separation to reveal structure without fragmenting into tiny cluster.

  • Referee Crew Clusterss (k) : 8 clusters. In Figure 9, CH index keeps rising but shows a clear bend near k≈7–8, so 8 was chosen.

K-Means Results

Two K-Means clustering were performed for Individual Referees and the Referee Crew Combinations. The features chosen were the average foul difference and average whistle turnover difference between away and home. PCAs plots were created to understand the variances explained by the features for the two different groups.

The first principal component in both plots acts like a “strict-on-away” axis: it increases when both the average foul difference and the whistle turnover difference increase (Away − Home).

The second component separates turnover-heavy behavior (higher turnover difference than foul difference) from foul-heavy behavior (the reverse). The 2D projections retain most signal (82% of variance for refs and 75% for crews), so positions are meaningful.

Individual Referee PCA

Thhe 2D PCA projection in Figure 10 shows PC1 = 50.8% account for variation and PC2 accounted for 31.3% variation with k-means cluster of 6 and moderate silhouette score of 0.376. The PCA shows the referee population separates into six behavior groups defined by the signs and magnitudes of the Away–Home differentials:

  • Cluster 0: Turnover-heavy on away (n = 4; games ≈ 4.0) Turnover differential clearly positive (+0.75), foul differential near zero (+0.08) This indicates a tendency to penalize ball-control violations on away teams more than personal fouls

  • Cluster 1: Lenient on away, foul-driven (n = 5; games ≈ 2.2) Foul differential is negative (−1.06) and turnover differential ≈ 0 This implies systematically fewer fouls on away teams

  • Cluster 2: Near-neutral, higher-volume (n = 7; mean games_officiated ≈ 17.7) Small, positive differentials (fouls ≈ +0.42; turnovers ≈ +0.37 per game) This cluster sits closest to the origin in PCA space and accounts for the bulk of exposure.

  • Cluster 3: Strict on away, foul-driven (n = 3; games ≈ 3.7) Large foul differential (+4.08) with a smaller positive Turnover differential (+0.83) This is the strongest away-side tilt in the sample and primarily carried by fouls

  • Cluster 4: Lenient on away, turnover-driven (n = 7; games ≈ 5.0) Turnover differential is negative (−0.67) with fouls ≈ 0 (−0.04). This indicates fewer whistle turnovers against away teams.

  • Cluster 5: Extreme lenient outlier (n = 1; games = 1.0) Very large negatives (fouls ≈ −5.0, turnoverss ≈ −1.0). Given the single referee and minimal exposure. This cluster should be treated as an outlier rather than a stable pattern.

The PCA space reveals a dominant neutral/moderate cluster with high exposure (cluster 2) and several smaller clusters that exhibit asymmetric tendencies: two “strict-on-away” profiles (turnover-heavy cluster 0; foul-heavy cluster 3) and two “lenient-on-away” profiles (foul-driven cluster 1; turnover-driven cluster 4).

The most extreme leniency (cluster 5) reflects a singleton with very low sample size. Overall, the structure supports interpretable, non-pervasive heterogeneity in officiating behavior concentrated in a few, relatively low-volume groups.

Individual crew model performance: chosen k: 6,  silhouette: 0.376

Table 3: Referee clusters — feature means and sizes
   cluster  avg_foul_diff_away_home  avg_turnover_diff_away_home  \
0        0                     0.08                         0.75   
1        1                    -1.06                         0.04   
2        2                     0.42                         0.37   
3        3                     4.08                         0.83   
4        4                    -0.03                        -0.67   
5        5                    -5.00                        -1.00   

   games_officiated  n  
0              4.00  4  
1              2.20  5  
2             17.71  7  
3              3.67  3  
4              5.00  7  
5              1.00  1  

Referee Crew PCA

Figure 11 partitions crews into k = 8 clusters (silhouette ≈ 0.44) on the 2D PCA space (PC1 = 46.1% var; PC2 = 29.5% var). The clusters map cleanly to away–home whistle profiles and show greater dispersion on turnover-type calls than on personal fouls.

Strict on away (foul-led):

  • Cluster 4 (n=15) fouls difference ≈ +4.87, TO difference ≈ +0.53

  • Custer 0 (n=9) balanced strictness fouls difference ≈ +2.56, TO difference ≈ +2.56

  • Custer 7 (n=4) very strict fouls difference ≈ +9.00, TO difference ≈ +4.75

Lenient on away:

  • Cluster 3 (n=12) foul-lenient fouls difference ≈ −3.58, TO difference ≈ +0.17

  • Cluster 2 (n=7) mild leniency on both fouls difference ≈ −0.71, TO difference ≈ −0.21

  • Cluster 6 (n=1) extreme turnover-lenient fouls difference ≈ −1.00, TO difference ≈ −10.00 (clear outlier)

Mixed profiles:

  • Cluster 1 (n=7) foul-heavy on away but turnover-lenient fouls difference ≈ +0.43, TO difference ≈ −2.29

  • Cluster 5 (n=3) foul-lenient but turnover-heavy on away fouls difference ≈ −4.00, TO difference ≈ +5.33

The range in turnover differentials (−10 to +5.3) is wider than that for fouls (−4 to +9). This confirms that crew effects are more pronounced for ball-control/violation calls than for personal fouls. Most crews occupy a near-neutral diagonal ridge in PCA space, while a small number of clusters exhibit marked strict or lenient tendencies, including one extreme turnover-lenient outlier.

Crew-level model performance: chosen k: 8, silhouette: 0.442

Table 4: Crew clusters — feature means and sizes
   cluster  avg_foul_diff_per_game  avg_to_diff_per_game  games_officiated   n
0        0                    2.56                  2.56               1.0   9
1        1                    0.43                 -2.29               1.0   7
2        2                   -0.71                 -0.21               2.0   7
3        3                   -3.58                  0.17               1.0  12
4        4                    4.87                  0.53               1.0  15
5        5                   -4.00                  5.33               1.0   3
6        6                   -1.00                -10.00               1.0   1
7        7                    9.00                  4.75               1.0   4

Conclusion

Using principal component analysis (PCA) of Away–Home whistle differentials and k-means clustering (k = 6 for referees; k = 8 for crews), we find that officiating patterns are concentrated rather than pervasive. The two-dimensional PCA projections capture a meaningful share of variation (≈82% for referees; ≈75% for crews), and the cluster separation is moderate by silhouette score. Both of these support an interpretable structure without overfitting.

At the crew level, most crews lie near neutrality, but a small subset occupies a “strict-on-away” region (e.g., fouls +2.6 to +9; turnovers +0.5 to +4.8). These crews exhibit tendencies that could plausibly amplify home-team win probability, although causal claims require confirmatory modeling. The crews can be seperated into four behavioral types: strict on away, foul-heavy on away, turnover-heavy on away, and lenient on away. Dispersion is greater for turnover-whistle differentials than for foul differentials, indicating that crew effects manifest more strongly in ball-control calls (e.g., travels, 3-seconds) than in personal fouls.

At the individual-referee level, most officials cluster around zero on both dimensions, but several outliers are evident: two officials (e.g., Ref D, Ref V) display a strict-on-away profile; Ref B is foul-heavy on away; and Ref P is lenient on away. Observed asymmetries are driven by a small set of actors rather than the referee population at large.

Overall, these findings suggest crew-specific and referee-specific tendencies rather than league-wide bias. Given data limitations (the dataset omits portions of a season and the current season), future work should extend coverage and test these patterns with inferential models (e.g., mixed-effects or hierarchical regressions of home wins and foul/turnover differentials with crew/ref effects and game-context controls) to establish robustness and practical impact.

References

[1] WNBA Play-By-Play Dataset [https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021/data]

[2] PCA Break Out [https://www.datacamp.com/tutorial/principal-component-analysis-in-python]

[3] K-Means and PCA [https://365datascience.com/tutorials/python-tutorials/pca-k-means/]

[4] PCA Features [https://drlee.io/the-ultimate-step-by-step-guide-to-data-mining-with-pca-and-kmeans-83a2bcfdba7d]

[5] MatPlot Bar Labeling [https://www.geeksforgeeks.org/python/adding-value-labels-on-a-matplotlib-bar-chart/]